Entropic analysis of the role of words in literary texts 
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Beyond the local constraints imposed by grammar, words concatenated in long sequences carrying 
a complex message show statistical regularities that may reflect their linguistic role in the message. 
In this paper, we perform a systematic statistical analysis of the use of words in literary English 
corpora. We show that there is a quantitative relation between the role of content words in literary 
English and the Shannon information entropy defined over an appropriate probability distribution. 
Without assuming any previous knowledge about the syntactic structure of language, we are able 
to cluster certain groups of words according to their specific role in the text. 



Language is probably the most complex function of our 
brain. Its evolutionary success has been attributed to 
the high degree of combinatorial power derived from its 
fundamental syntactic structure jL|. Syntactic rules act 
locally at the sentence level and do not necessarily ac- 
count for higher levels of organisation in large sequences 
of words conveying a coherent message ||. In this re- 
spect, the situation is similar to that found in other nat- 
ural sequences with non-trivial information content, such 
as the genetic code, where more than one structural level 
may be discerned H[|. In the case of human language, 
complex hierarchies have been revealed at levels rang- 
ing from word form to sentence structure More- 
over, it has been argued that long samples of continuous 
written or spoken language also possess a hierarchical 
macrostructure at levels beyond the sentence [||. In a 
coarse grained splitting of this complex organisation we 
could distinguish three basic structural levels in the anal- 
ysis of long language records. The first one corresponds 
to the absolute quantitative occurrence of words, that is, 
which words are used and how many times each. The sec- 
ond level of organisation refers to the particular ways in 
which words can be linked into sentences according to the 
syntactic rules of language. Finally, at the highest level, 
grammatical sentences are combined in order to thread 
a meaningful messages as part of a communications pro- 
cess. This assemblage of sentences into more complex 
structures is not strictly framed by a set of precise pre- 
scriptions and is more related to the particular nature of 
the message conveyed by the sequence. In this paper we 
shall focus on the statistical manifestations of this high 
level of organisation in language. By means of an en- 
tropic measure of word distribution in literary corpora, 
we show that the statistical realisation of words within 
a complex communicative structure reflects systematic 
patterns which can be used to cluster words according to 
their specific linguistic role. 

Zipf 's analysis || represents the crudest statistical ap- 
proach by which some quantitative information about 



the use of words in a corpus of written language can 
be obtained. Basically, it consists of counting the num- 
ber of occurrences of each different word in the corpus, 
and then producing a list of these words sorted accord- 
ing to decreasing frequency. The rank-frequency distri- 
bution thus obtained presents robust quantitative regu- 
larities that have been tested over a vast variety of natu- 
ral languages. However, the frequency-ordered list alone 
bears little information on the particular role of words in 
the lexicon, as can be realised by noting that after shuf- 
fling the corpus the rank-frequency distribution remains 
intact. Naturally, the first ranks in the list belong to 
the commonest words in the language style of the source 
text, e.g. function words and some pronouns in literary 
English. After them, words related to the particular con- 
tents of the text start to appear. An illustration is given 
in Table [j], where we show some portions of the first ranks 
in Zipf 's classification of the words from William Shake- 
speare's Hamlet. 

It is therefore clear that in order to extract information 
about the specific role of words by statistical analysis, we 
must be able to gauge not only how often a word is used 
but also where it is used in the text. A statistical mea- 
sure that fulfills the aforementioned requirement can be 
constructed from a suitable adaptation of the Shannon 
information entropy 0. Let us think of a given text 
corpus as made up of the concatenation of P individual 
parts. The kind of partitions we are going to consider 
here are those that arise naturally at different scales as 
a consequence of the global structure of literary corpora. 
Examples of these natural divisions are the individual 
books of an author's whole production, and the collec- 
tion of chapters in a single book. Calling iVj the total 
number of words in part i, and rii the number of occur- 
rences of a given word in that part, the ratio = rii/Ni 
gives the frequency of appearance of the word in ques- 
tion in part i. For each word, it is possible to define a 
probability measure pi over the partition as 
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The quantity pi stands for the probability of finding the 
word in part i, given that it is present in the corpus. 
The Shannon information entropy associated with the 
discrete probability distribution pi reads 



I p 



(2) 



Generally, the value of S is different for each word. As 
discussed below, the entropy of a given word provides a 
characterization of its distribution over the different par- 
titions. Note that, independently of the specific values of 
Pi, we have < S < 1. 

To gain insight on the kind of measure represented by 
S, two limiting cases are worth mentioning. If a given 
word is uniformly distributed over the P parts, pi = 1/ P 
for all i and equation (|J) yields 5=1. Conversely, if a 
word appears in part j only, we have pj = 1 and pi = for 
i ^ j, so that S — 0. These examples represent extreme 
real cases in the distribution of words. In a first approx- 
imation one expects that certain words are evenly used 
throughout the text regardless of the specific contents 
of the different parts. Possible candidates are given by 
function words, such as articles and prepositions, whose 
use is only weakly affected by the specific character of the 
different parts in a homogeneous corpus. Other words, 
associated with more particular aspects of each part may 
fluctuate considerably in their use, thus having lower val- 
ues of the entropy. We show in the following that in just a 
few statistical quantities such as frequency and entropy 
there is relevant information about the role of certain 
word classes. 

Figure [l] shows 1 — S, with S calculated as in equation 
(||), versus the number of occurrences n, for each different 
word in a corpus made up of 36 plays by William Shake- 
speare. The total number of words in the set of plays 
adds up to 885, 535 with a vocabulary of 23, 150 differ- 
ent words. In this case, the natural division that we are 
considering is given by the individual plays. The struc- 
ture of the graph calls for two different levels of analysis. 
First, its most evident feature, that is the tendency of the 
entropy to increase with n, represents a general trend of 
the data which should be explained as a consequence of 
basic statistical facts. In qualitative terms, it implies 
that on average the more frequent a word is the more 
uniformly is it used. Second, a somewhat deeper quest 
may be required in order to reveal whether the individ- 
ual deviations from this general trend are related to the 
particular usage nuances of words, as imposed by their 
specific role in the text. Whereas most methods of word 
clustering according to predefined classes heavily rely on 
a certain amount of pre-processing, such as tagging words 
as members of particular grammatical categories pj > we 



shall address this point without any a priori linguistic 
knowledge, save the mere identification of words as the 
minimal structural units of language. 

In order to clarify to which extent the features observed 
in figure [I] reflect basic statistical properties of the distri- 
bution of words over the different parts of the corpus, we 
performed a simple numerical experiment which consists 
in generating a random version of the 36 Shakespeare's 
plays. This was done in the following steps. First, we 
considered a list of all the words used in the plays, each 
appearing exactly the number of times it was used in 
the real corpus. Second, we shuffled the list thus com- 
pletely destroying the natural order of words. Third, we 
took the words one by one from the list and wrote a ran- 
dom version of each play containing the same number of 
words as its real counterpart. In figure ^, we compare 
the randomised version with the data of figure [|. It is 
evident that, on one hand, the tendency of the entropy 
to grow with n is preserved. On the other, the large fluc- 
tuations in the value of S, as well as the presence of rela- 
tively infrequent words with very low entropy, are totally 
erased in the randomised version. On average, words 
have higher entropies in the random realisation than in 
the actual corpus. Indeed, this is what one would ex- 
pect for certain word classes such as proper nouns and, 
in general, for content words that allude to objects, sit- 
uations or actions related to specific parts of the corpus. 
All the inhomogeneities that characterise the use of such 
words disappear in the random version, and consequently 
render higher values of the entropy. 

Besides its value as a comparative benchmark, the ran- 
dom version of the corpus has the appeal of being ana- 
lytically tractable, at least in a slightly modified form, 
as follows. Let us suppose that we have a corpus of 
N words consisting of P parts, with Ni words in part 
i (i = 1, . . . , P). The probability that a word appears n\ 
times in part 1, ni times in part 2, and so on, is 
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with n = Y]j rij . In the special case where all the parts 
have exactly the same number of words, i.e. Ni — N/P 
for all i, the average value of the entropy resulting from 
the probability given by equation (j|) can be written in 
terms of n only, as 
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For highly frequent words, n 3> 1, equation (|]) assumes 
a particularly simple form, namely 
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The curve in figure || stands for the function 1 — (S(n)), 
with (S(n)) given by equation ^ with P = 36. First, we 
note that despite the fact that (S(n)) was calculated as- 
suming that all the parts have the same number of words, 
its agreement with the random realisation for all the fre- 
quency range is very good. Moreover, it can be seen that 
after a short transient in the region of low n, 1 — (S(n)) 
soon develops the asymptotic form given by equation (j^) 
-a straight line of slope —1 in this log- log plot. 

We have so far explained the general trend in the be- 
haviour of the entropy with simple statistical considera- 
tions pertaining the distribution of words over the differ- 
ent parts of the corpus. In the second part of our analysis 
we shall address the more interesting question of how the 
information contained in the entropy may reveal natural 
groupings of English words according to their particu- 
lar role in the text. In order to accomplish this task 
it proves useful to define adequate coordinates whereby 
words can be associated to points in a suitable space. 
As a consequence of that, the classification of words ac- 
cording to their role should emerge naturally from the 
preference of certain words to occupy more or less defi- 
nite regions of that space. As one of these coordinates 
we take a quantity reflecting the degree of use of a word 
in the text, namely, the number of occurrences n. The 
second coordinate is introduced to measure the devia- 
tion of the entropy of each word from the value predicted 
by the random-corpus model, as follows. We have seen 
that equation (^) accounts for the expected statistical 
decrease in the value of the entropy as a word becomes 
less frequent. This effect can be separated from the be- 
haviour of the words in the real texts, in order to reveal 
genuine information on the linguistic usage of words. We 
rewrite equation (||) as 

where the right-hand side is independent of n. Therefore, 
in a graph of (1 — S)n versus n, the words whose actual 
distribution agrees with the random-corpus model should 
approximately fall along a horizontal line. All apprecia- 
ble departures from this line should be expected to bear 
some relation to the non-random character of the usage 
of words, and therefore may reflect actual linguistic infor- 
mation. By means of relation (ph we are therefore able to 
filter out all the trivial part of the statistical behaviour. 
Figure || shows actual data for the Shakespeare corpus in 
a plot of (1 — S)n versus n. The horizontal line stands 
for the value given in the right-hand side of equation (|^) 
for P = 36. 

In order to reveal whether the words show some sort of 
systematic distribution over the plane according to their 
linguistic role, we proceeded to classify by hand the first 
2, 000 words into different sets. The classification stops 
there due to the fact that words close to that rank oc- 
cur in the whole corpus a number of times similar to 



the number of parts in the corpus division, n ~ P, thus 
representing a limit beyond which statistical fluctuations 
start to dominate. The six categories we set out in groups 
were the following: (a) proper nouns, (b) pronouns, (c) 
nouns referring to humans, such as soldier and brother, 

(d) nouns referring to nobility status -such as King and 
Duke, which have a relevant place in Shakespeare's plays- 

(e) common nouns (not referring to humans or to nobility 
status) and adjectives, and finally (f) verbs and adverbs. 
In case of ambiguity about the inclusion of a word into 
a certain class we simply left it out and did not classify 
it, hence the total number of classified words was finally 
around 1,400. 

The results of this classification can be seen in Figure 
|], and in fact reveal a marked clustering of words over 
definite regions of the two dimensional space spanned 
by (1 — S)n and n. The sharpest distribution, shown in 
Figure |]a, corresponds to proper nouns. These words oc- 
cupy a dense and elongated region which is limited from 
above by the straight line representing the identity func- 
tion (1 — S)n — n. Naturally, proper nouns are expected 
to define a class of words strongly related to particular 
parts of the corpus. In consequence, their entropies tend 
to be very low on average, if not strictly zero as in the 
case of many proper nouns appearing in just one of the 
Shakespeare's plays. Thereby, in a graph of (1 — S)n ver- 
sus n, words having values of the entropy close to zero 
have (1 — S)n n and fall close to the identity function. 

The distribution of other word classes is less obvious. 
Verbs and adverbs (Fig. |jf) are closest to the random 
distribution, covering a wide range of ranks. On average, 
common nouns and adjectives (Fig. |]e) are farther from 
the random distribution and, at the same time, are less 
frequent. Nouns referring to humans (Fig. |]c) cover ap- 
proximately the same frequencies as common nouns, but 
their distribution is typically more heterogeneous. The 
entropy of some words in this class is, in fact, quite close 
to zero. The three most frequent nouns in the Shake- 
speare corpus are Lord, King, and Sir. All of them be- 
long to the class of nouns referring to nobility status (Fig. 

, which spans a large interval of frequencies and has 
systematically low entropies. The specificity of nobility 
titles with respect to the different parts of the corpus can 
be explained with essentially the same arguments as for 
proper nouns. Considerably more surprising is the case 
of pronouns (Fig. ^b) which, as expected, are highly fre- 
quent, but whose entropies reveal a markedly nonuniform 
distribution over the corpus. The origin of this hetero- 
geneity in the distribution of pronouns is not at all clear, 
and deserves further investigation. In Fig. |^ we have 
drawn together the zones occupied by all the classes to 
make more clear their relative differences in frequency 
and homogeneity. 

We have performed the same statistical analysis over 
other literary corpora, obtaining totally consistent re- 
sults. The same organisation of words was observed in 
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the works of Charles Dickens and Robert Louis Steven- 
son. In particular, nouns and adjectives tend to be 
more heterogeneously distributed than verbs and ad- 
verbs. Nouns referring to humans have systematically 
lower entropies. Pronouns, in turn, exhibit an unexpect- 
edly heterogeneous distribution for their high frequencies. 

In summary, in this work we have concentrated on the 
statistical analysis of language at a high level of its struc- 
tural hierarchy, beyond the local rules defined by sentence 
grammar. We started off by introducing an adequate 
measure of the entropy of words in a text corpus made 
up of a number of individual parts. With respect to Zipf 's 
analysis, which focuses on the frequency distribution of 
words, the study of entropy provides a second degree of 
freedom that resolves the statistical behaviour of words 
in connection with their linguistic role. By means of our 
random-corpus model we were able to extract the non- 
trivial part of the distribution of words. This procedure 
reveals statistical regularities in the distribution, that can 
be used to cluster words according to their role in the cor- 
pus without assuming any a priori linguistic knowledge. 
Ultimately, such regularities should stand as a manifes- 
tation of long-range linguistic structures inherent to the 
communication process. We believe that a thorough ex- 
planation of the origin of these global structures in lan- 
guage may eventually contribute to the understanding of 
the psycolinguistic basis for the modelling of reality by 
the brain. 

Critical reading of the text by Susanna Manrubia is 
gratefully acknowledged. 
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word 


rank 


number of occurrences 


the 


1 


1087 


and 


2 


968 


to 


3 


760 


of 


4 


669 


I 


5 


633 


a 


6 


567 


you 


7 


558 


Lord 


25 


225 


he 


26 


224 


be 


27 


223 


what 


28 


219 


King 


29 


201 


him 


30 


197 


Queen 


42 


120 


our 


43 


120 


if 


44 


117 


or 


45 


115 


shall 


46 


114 


Hamlet 


47 


112 



TABLE I. Rank classification of words from Shakespeare's 
Hamlet. 
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FIG. 1. Plot of 1 — S versus the number of occurrences n for each word of a corpus made up of 36 plays by William 
Shakespeare. The total number of words is 885,535 and the number of different words is 23,150. 
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FIG. 2. Comparison between the data shown in Figure |l| (black dots) and a randomised version of the Shakespeare corpus 
(grey dots). The curve stands for the analytical approximation for the random corpus, equation (^). 
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FIG. 3. Plot of (1 — S)n vs. n for all the different words in the Shakespeare corpus. The horizontal line shows the expected 
value of (1 — S)n for frequent, uniformly distributed words, as given by equation (^). 
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FIG. 4. Plot of (1 — S)n vs. n for six relevant word classes: (a) proper nouns, (b) pronouns, (c) nouns referring to humans, 
(d) nouns referring to nobility status, (e) common nouns (not referring to humans or to nobility status) and adjectives, and (f) 
verbs and adverbs. In each plot, the horizontal dotted line stands for the asymptotic value of (1 — S)n for the random-corpus 
model, equation ^. Words close to this line are homogeneously distributed over the corpus. The oblique dotted line corresponds 
to S = 0. Proximity to this line indicates extreme inhomogeneity in the distribution. 
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FIG. 5. Schematic combined representation of the zones occupied by the word classes of figure W. proper nouns (a); pronouns 
(b); common nouns and adjectives, including those referring to humans but not to nobility status (c); nouns referring to nobility 
status (d); verbs and adverbs (e). 
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